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Abstract — We consider the problem of sparse canonical corre- 
lation analysis (CCA), i.e., the search for two linear combinations, 
one for each multivariate, that yield maximum correlation using a 
specified number of variables. We propose an efficient numerical 
approximation based on a direct greedy approach which bounds 
the correlation at each stage. The method is specifically designed 
to cope with large data sets and its computational complexity 
depends only on the sparsity levels. We analyze the algorithm's 
performance through the tradeoff between correlation and par- 
simony. The results of numerical simulation suggest that a 
significant portion of the correlation may be captured using a 
relatively small number of variables. In addition, we examine the 
use of sparse CCA as a regularization method when the number 
of available samples is small compared to the dimensions of the 
multivariates. 



I. Introduction 

Canonical correlation analysis (CCA), introduced by Harold 
Hotelling [1], is a standard technique in multivariate data anal- 
ysis for extracting common features from a pair of data sources 
[2], [3]. Each of these data sources generates a random vector 
that we call a multivariate. Unlike classical dimensionality 
reduction methods which address one multivariate, CCA takes 
into account the statistical relations between samples from 
two spaces of possibly different dimensions and structure. In 
particular, it searches for two linear combinations, one for 
each multivariate, in order to maximize their correlation. It 
is used in different disciplines as a stand-alone tool or as a 
preprocessing step for other statistical methods. Furthermore, 
CCA is a generalized framework which includes numerous 
classical methods in statistics, e.g.. Principal Component 
Analysis (PCA), Partial Least Squares (PLS) and Multiple 
Linear Regression (MLR) [4]. CCA has recently regained 
attention with the advent of kernel CCA and its application 
to independent component analysis [5], [6]. 

The last decade has witnessed a growing interest in the 
search for sparse representations of signals and sparse nu- 
merical methods. Thus, we consider the problem of sparse 
CCA, i.e., the search for linear combinations with maximal 
correlation using a small number of variables. The quest 
for sparsity can be motivated through various reasonings. 
First is the ability to interpret and visualize the results. A 
small number of variables allows us to get the "big picture", 
while sacrificing some of the small details. Moreover, sparse 
representations enable the use of computationally efficient 
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numerical methods, compression techniques, as well as noise 
reduction algorithms. The second motivation for sparsity is 
regularization and stability. One of the main vulnerabilities of 
CCA is its sensitivity to a small number of observations. Thus, 
regularized methods such as ridge CCA [7] must be used. In 
this context, sparse CCA is a subset selection scheme which 
allows us to reduce the dimensions of the vectors and obtain 
a stable solution. 

To the best of our knowledge the first reference to sparse 
CCA appeared in [2] where backward and stepwise subset 
selection were proposed. This discussion was of qualitative 
nature and no specific numerical algorithm was proposed. 
Recently, increasing demands for multidimensional data pro- 
cessing and decreasing computational cost has caused the 
topic to rise to prominence once again [8]-[13]. The main 
disadvantages with these current solutions is that there is no 
direct control over the sparsity and it is difficult (and non- 
intuitive) to select their optimal hyperparameters. In addition, 
the computational complexity of most of these methods is too 
high for practical applications with high dimensional data sets. 
Sparse CCA has also been implicitly addressed in [9], [14] and 
is intimately related to the recent results on sparse PCA [9], 
[15]-[17]. Indeed, our proposed solution is an extension of the 
results in [17] to CCA. 

The main contribution of this work is twofold. First, we 
derive CCA algorithms with direct control over the sparsity 
in each of the multivariates and examine their performance. 
Our computationally efficient methods are specifically aimed 
at understanding the relations between two data sets of large 
dimensions. We adopt a forward (or backward) greedy ap- 
proach which is based on sequentially picking (or dropping) 
variables. At each stage, we bound the optimal CCA solution 
and bypass the need to resolve the full problem. Moreover, 
the computational complexity of the forward greedy method 
does not depend on the dimensions of the data but only on the 
sparsity parameters. Numerical simulation results show that a 
significant portion of the correlation can be efficiently captured 
using a relatively low number of non-zero coefficients. Our 
second contribution is investigation of sparse CCA as a reg- 
ularization method. Using empirical simulations we examine 
the use of the different algorithms when the dimensions of 
the multivariates are larger than (or of the same order of) 
the number of samples and demonstrate the advantage of 
sparse CCA. In this context, one of the advantages of the 
greedy approach is that it generates the full sparsity path in 
a single run and allows for efficient parameter tuning using 
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cross validation. 

The paper is organized as follows. We begin by describing 
the standard CCA formulation and solution in Section 11. 
Sparse CCA is addressed in Section 111 where we review the 
existing approaches and derive the proposed greedy method. In 
Section IV, we provide performance analysis using numerical 
simulations and assess the tradeoff between correlation and 
parsimony, as well as its use in regularization. Finally, a 
discussion is provided in Section V. 

The following notation is used. Boldface upper case letters 
denote matrices, boldface lower case letters denote colunm 
vectors, and standard lower case letters denote scalars. The 
superscripts (•)^ and (•)~^ denote the transpose and inverse 
operators, respectively. By I we denote the identity matrix. 
The operator || • ||2 denotes the L2 norm, and || • ||o denotes the 
cardinality operator. For two sets of indices I and J, the matrix 
X^''^ denotes the submatrix of X with the rows indexed by / 
and colunms indexed by J. Finally, X :^ or X ^ means 
that the matrix X is positive definite or positive semidefinite, 
respectively. 

II. Review on CCA 

In this section, we provide a review of classical CCA. Let 
X and y be two zero mean random vectors of lengths n and 
m, respectively, with joint covariance matrix: 



S = 



to, 



(1) 



where S,., and S^.^, are the covariance of x, the covariance 
of y, and their cross covariance, respectively. CCA considers 
the problem of finding two linear combinations X = a^x and 
Y = b^y with maximal correlation defined as 



cov{X, Y} 



(2) 



where var{ } and cov{ } are the variance and covariance 
operators, respectively, and we define 0/0 = 1. In terms of 
a and b the correlation can be easily expressed as 



Va^S^a^b'^'Syb' 
Thus, CCA considers the following optimization problem 



(3) 



max 



a Sj^yb 



(4) 



Problem (4) is a multidimensional non-concave maximization 
and therefore appears difficult on first sight. However, it has 
a simple closed form solution via the generalized eigenvalue 
decomposition (GEVD). Indeed, if S ;^ 0, it is easy to show 
that the optimal a and b must satisfy: 
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(5) 



for some < A < 1. Thus, the optimal value of (4) is just the 
principal generalized eigenvalue Amax of the pencil (5) and 
the optimal solution a and b can be obtained by appropriately 
partitioning the associated eigenvector. These solutions are 



invariant to scaling of a and b and it is customary to normalize 
them such that a^S^^a = 1 and b^S^b = 1. On the other 
hand, if S is rank deficient, then choosing a and b as the 
upper and lower partitions of any vector in its null space will 
lead to full correlation, i.e., p = 1. 

In practice, the covariance matrices Tix and and cross 
covariance matrix 'Sj.y are usually unavailable. Instead, mul- 
tiple independent observations Xj and yj for i = 1 . . . iV are 
measured and used to construct sample estimates of the (cross) 
covariance matrices: 



— X'^X, ty 



1 



rX^Y, 



^xy = 



where X = [xi...xjv] and Y = [yi...yiv]- Then, these 
empirical matrices are used in the CCA formulation: 



a Sjiyb 



Va^S^a^b^Sj/b 



(7) 



Clearly, if TV is sufficiently large then this sample approach 
performs well. However, in many apphcations, the number of 
samples A'' is not sufficient. In fact, in the extreme case in 
which N < n + m the sample covariance is rank deficient and 
p = I independently of the data. The standard approach in 
such cases is to regularize the covariance matrices and solve 
p (llx + fyl, '^y + Ecnl, '^xy'^ where Cx > and > are 
small tuning ridge parameters [7]. 

CCA can be viewed as a unified framework for dimensional- 
ity reduction in multivariate data analysis and generalizes other 
existing methods. It is a generalization of PCA which seeks the 
directions that maximize the variance of x, and addresses the 
directions corresponding to the correlation between x and y. A 
special case of CCA is PLS which maximizes the covariance 
of X and y (which is equivalent to choosing S^. = Sj, = I). 
Similarly, MLR normalizes only one of the multivariates. In 
fact, the regularized CCA mentioned above can be interpreted 
as a combination of PLS and CCA [4]. 

111. Sparse CCA 

We consider the problem of sparse CCA, i.e., finding a pair 
of linear combinations of x and y with prescribed cardinaUty 
which maximize the correlation. Mathematically, we define 
sparse CCA as the solution to 



maxa^o,b#o 
s.t. 



y/aT-S^SLy/hTTlyb 

||a||o < ka 
\Mo<ki,. 



(8) 



Similarly, sparse PLS and sparse MLR are defined as special 
cases of (8) by choosing Y^x = I and/or Sy = I. In general, 
all of these problems are difficult combinatorial problems. 
In small dimensions, they can be solved using a brute force 
search over all possible sparsity patterns and solving the as- 
sociated subproblem via GEVD. Unfortunately, this approach 
is impractical for even moderate sizes of data sets due its 
exponentially increasing computational complexity. In fact, it 
is a generalization of sparse PCA which has been proven NP- 
hard [17]. Thus, suboptimal but efficient approaches are in 
order and will be discussed in the rest of this section. 
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A. Existing solutions 

We now briefly review the different approaches to sparse 
CCA that appeared in the last few years. Most of the methods 
are based on the well known LASSO trick in which the 
difficult combinatorial cardinality constraints are approximated 
through the convex LI norm. This approach has shown promis- 
ing performance in the context of sparse linear regression 
[18]. Unfortunately, it is not sufficient in the CCA formulation 
since the objective itself is not concave. Thus additional 
approximations are required to transform the problems into 
tractable form. 

Sparse dimensionahty reduction of rectangular matrices 
was considered in [9] by combining the LASSO trick with 
semidefinite relaxation. In our context, this is exactly sparse 
PLS which is a special case of sparse CCA. Alternatively, CCA 
can be formulated as two constrained simultaneous regressions 
(x on y, and y on x). Thus, an appealing approach to sparse 
CCA is to use LASSO penalized regressions. Based on this 
idea, [8] proposed to approximate the non convex constraints 
using the infinity norm. Similarly, [10], [11] proposed to use 
two nested iterative LASSO type regressions. 

There are two main disadvantages to the LASSO based 
techniques. First, there is no mathematical justification for 
their approximations of the correlation objective. Second, there 
is no direct control over sparsity. Their parameter tuning is 
difficult as the relation between the LI norm and the sparsity 
parameters is highly nonlinear. The algorithms need to be run 
for each possible value of the parameters and it is tedious to 
obtain the full sparsity path. 

An alternative approach for sparse CCA is sparse Bayes 
learning [12], [13]. These methods are based on the prob- 
abilistic interpretation of CCA, i.e., its formulation as an 
estimation problem. It was shown that using different prior 
probabilistic models, sparse solutions can be obtained. The 
main disadvantage of this approach is again the lack of direct 
control on sparsity, and the difficulty in obtaining its complete 
sparsity path. 

Altogether, these works demonstrate the growing interest in 
deriving efficient sparse CCA algorithms aimed at large data 
sets with simple and intuitive parameter tuning. 

B. Greedy approach 

A standard approach to combinatorial problems is the for- 
ward (backward) greedy solution which sequentially picks (or 
drops) the variables at each stage one by one. The backward 
greedy approach to CCA was proposed in [2] but no specific 
algorithm was derived or analyzed. In modem apphcations, the 
number of dimensions may be much larger than the number 
of samples and therefore we provide the details of the more 
natural forward strategy. Nonetheless, the backward approach 
can be derived in a straightforward manner. In addition, we 
derive an efficient approximation to the subproblems at each 
stage which significantly reduces the computational complex- 
ity. A similar approach in the context of PCA can be found 
in [17]. 

Our goal is to find the two sparsity patterns, i.e., two sets 
of indices I and J corresponding to the indices of the chosen 



variables in x and y, respectively. The greedy algorithm 
chooses the first elements in both sets as the solution to 



(9) 



Thus, / = {i} and J = {j}. Next, the algorithm sequentially 
examines all the remaining indices and computes 




IUi,J\ 
xy ) 



and 



maxp(S^.^Sf^•.^-^S^'/^^•). 



(10) 



(11) 



Depending on whether (10) is greater or less than (11), we 
add index i or j to / or J, respectively. We emphasize that 
at each stage, only one index is added either to / or J. Once 
one of the set reaches its required size ka or k^, the algorithm 
continues to add indices only to the other set and terminates 
when this set reaches its required size as well. It outputs the 
full sparsity path, and returns ka — kh ~ I pairs of vectors 
associated with the sparsity patterns in each of the stages. 

The computational complexity of the algorithm is poly- 
nomial in the dimensions of the problem. At each of its 
i = 1, - ■ ■ ,ka + kb — 1 stages, the algorithm computes 
n + m — i CCA solutions as expressed in (10) and (11) in 
order to select the patterns for the next stage. It is therefore 
reasonable for small problems, but is still impractical for 
many applications. Instead, we now propose an alternative 
approach that computes only one CCA per stage and reduces 
the complexity significantly. 

An approximate greedy solution can be easily obtained by 
approximating (10) and (11) instead of solving them exactly. 
Consider for example (10) where index i is added to the 
set /. Let and b^'"^ denote the optimal solution to 

p{'Sy ,'Ei\f). In order to evaluate (10) we need to 
recalculate both a^^' '^ and b-'^'-^'''^ for each i ^ I. However, 
the previous b^''' is of the same dimension and still feasible. 
Thus, we can optimize only with respect to a^'-''''^ (whose di- 
mension has increased). This approach provides the following 
bounds: 

Lemma 1: Let a.^'^ and b^'-^ be the optimal solution to 
p{i:i'',-EJ'J,i:i'J) in (4). Then, 

P [^x i^y '^xy )^P [^x '^y -i^xyj+^i 

/ (S^./, ^-'^^^ S^'/^^) >p' (S^'^ 1^1'^ S^'-^) + 7j'^ 

(12) 

where 

i ~ r n T r -1-1 

X ^x 



(13) 
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and we assume that all the involved matrix inversions are 
nonsingular. 

Before proving the lemma, we note that it provides lower 
bounds on the increase in cross correlation due to including 
an additional element in x or y without the need of solving 
a full GEVD. Thus, we propose the following approximate 
greedy approach. For each sparsity pattern {/, J}, one CCA 
is computed via GEVD in order to obtain and b^'''. Then, 
the next sparsity pattern is obtained by adding the element that 
maximizes Sj'"^ or •jj''^ among i ^ I and j ^ J. 

Proof: We begin by rewriting (4) as a quadratic maxi- 
mization with constraints. Thus, we define a^''' and b^''' as 
the solution to 



-ij.j 



maxa,b ^ '^iy^ 
s.t. a^S^'^a = 1 

b^S:!-^b = 1. 



(14) 



Now, consider the problem when we add variable j to the set 
J: 



1,1 yiJUj.JUj yi/,JUj 
X 5 > -^xy 



) = <^ s.t. a^S^.^a = 1 

[ b^S^u^"^u^b = 1. 

(15) 



Clearly, the vector a = a"^ is still feasible (though not 
necessarily optimal) and yields a lower bound: 



(16) 



Changing variables b = [S^'^-' ^ b results in: 
- /yi7,7 y. Juj, Juj ■yi,Juj\ > / maxjg^ c'^b 

P{^x ,^y ,^xy ) ||b||2 = 1, 



where 



\\2, 



Using the Cauchy Schwartz inequality 
c'^b < lldhllbll 

we obtain 

^JUj,JUj 



(17) 



(18) 



(19) 



y.JUJ,JUJ yl 
y ' X 



> 



V I 



Therefore, 



.(20) 



\^xy] 



J,J 



(21) 



Finally, (12) is obtained by using the inversion formula for 
partitioned matrices and simplifying the terms. ■ 




1-zero coeftfctents 



Fig. 1. Correlation vs. Sparsity, n = m = 7. 



IV. Numerical results 

We now provide a few numerical examples illustrating the 
behavior of the greedy sparse CCA methods. In all of the 
simulations below, we implement the greedy methods using 
the bounds in Lemma 1. In the first experiment we evaluate 
the validity of the approximate greedy approach. In particular, 
we choose n = m = 7, and generate 200 independent random 
realizations of the joint covariance matrix using the Wishart 
distribution with 7+7 degrees of freedom. For each realization, 
we run the approximate greedy forward and backward algo- 
rithms and calculate the full sparsity path. For comparison, we 
also compute the optimal sparse solutions using an exhaustive 
search. The results are presented in Fig. 1 where the average 
correlation is plotted as a function of the number of variables 
(or non-zero coefficients). The greedy methods capture a 
significant portion of the possible correlation. As expected, the 
forward greedy approach outperforms the backward method 
when high sparsity is critical. On the other hand, the backward 
method is preferable if large values of correlation are required. 

In the second experiment we demonstrate the performance 
of the approximate forward greedy approach in a large scale 
problem. We present results for a representative (randomly 
generated) covariance matrix of sizes n = m = 1000. Fig. 2 
shows the full sparsity path of the greedy method. It is easy 
to see that about 90 percent of the CCA correlation value can 
be captured using only half of the variables. Furthermore, if 
we choose to capture only 80 percent of the full correlation, 
then about a quarter of the variables are sufficient. 

In the third set of simulations, we examine the use of sparse 
CCA algorithms as regularization methods when the number 
of samples is not sufficient to estimate the covariance matrix 
efficiently. For simpUcity, we restrict our attention to CCA and 
PLS (which can be interpreted as an extreme case of ridge 
CCA). In addition, we show results for an alternative method 
in which the sample covariance S^; and Tiy are approximated 
as diagonal matrices with the sample variances (which are 
easier to estimate). We refer to this method as Diagonal CCA 
(DCCA). In order to assess the regularization properties of 
CCA we used the following procedure. We randomly generate 
a single "true" covariance matrix S and use it throughout 
all the simulations. Then, we generate N random Gaussian 
samples of x and y and estimate S. We apply the three 
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Fig. 2. Correlation vs. Sparsity, n = m = 1000. 
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Fig. 3. Sparsity as a regularized CCA metliod, n = 10, m = 10 and 
N = 20. 

approximate greedy sparse algorithms, CCA, PLS and DCCA, 
using the sample covariance and obtain the estimates a and 
b. Finally, our performance measure is the "true" correlation 
value associated with the estimated weights which is defined 
as: 

a^S .,^b_^ _ (22) 

We then repeat the above procedure (using the same "true" 
covariance matrix) 500 times and present the average value 
of (22) over these Monte Carlo trials. Fig. 3 provides these 
averages as a function of parsimony for two representative 
reaUzations of the "true" covariance matrix. Examining the 
curves reveals that variable selection is indeed a promising 
regularization strategy. The average correlation increases with 
the number of variables until it reaches a peak. After this peak, 
the number of samples are not sufficient to estimate the full 
covariance and it is better to reduce the number of variables 
through sparsity. DCCA can also be slightly improved by using 
fewer variables, and it seems that PLS performs best with no 
subset selection. 

V. Discussion 
We considered the problem of sparse CCA and discussed its 
implementation aspects and statistical properties. In particular, 
we derived direct greedy methods which are specifically 
designed to cope with large data sets. Similar to state of the 



art sparse regression methods, e.g.. Least Angle Regression 
(LARS) [19], the algorithms allow for direct control over the 
sparsity and provide the full sparsity path in a single run. 
We have demonstrated their performance advantage through 
numerical simulations. 

There are a few interesting directions for future research 
in sparse CCA. First, we have only addressed the first order 
sparse canonical components. In many appUcations, analysis 
of higher order canonical components is preferable. Numer- 
ically, this extension can be implemented by subtracting the 
first components and rerunning the algorithms. However, there 
remain interesting theoretical questions regarding the relations 
between the sparsity patterns of the different components. 
Second, while here we considered the case of a pair of mul- 
tivariates, it is possible to generalize the setting and address 
multivariate correlations between more than two data sets. 
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